258 research outputs found

    Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix

    Full text link
    One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness

    Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures

    Full text link
    Formal Concept Analysis "FCA" is a data analysis method which enables to discover hidden knowledge existing in data. A kind of hidden knowledge extracted from data is association rules. Different quality measures were reported in the literature to extract only relevant association rules. Given a dataset, the choice of a good quality measure remains a challenging task for a user. Given a quality measures evaluation matrix according to semantic properties, this paper describes how FCA can highlight quality measures with similar behavior in order to help the user during his choice. The aim of this article is the discovery of Interestingness Measures "IM" clusters, able to validate those found due to the hierarchical and partitioning clustering methods "AHC" and "k-means". Then, based on the theoretical study of sixty one interestingness measures according to nineteen properties, proposed in a recent study, "FCA" describes several groups of measures.Comment: 13 pages, 2 figure

    Categorization of interestingness measures for knowledge extraction

    Full text link
    Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify "good" properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.Comment: 34 pages, 4 figure

    A scalable mining of frequent quadratic concepts in d-folksonomies

    Full text link
    Folksonomy mining is grasping the interest of web 2.0 community since it represents the core data of social resource sharing systems. However, a scrutiny of the related works interested in mining folksonomies unveils that the time stamp dimension has not been considered. For example, the wealthy number of works dedicated to mining tri-concepts from folksonomies did not take into account time dimension. In this paper, we will consider a folksonomy commonly composed of triples and we shall consider the time as a new dimension. We motivate our approach by highlighting the battery of potential applications. Then, we present the foundations for mining quadri-concepts, provide a formal definition of the problem and introduce a new efficient algorithm, called QUADRICONS for its solution to allow for mining folksonomies in time, i.e., d-folksonomies. We also introduce a new closure operator that splits the induced search space into equivalence classes whose smallest elements are the quadri-minimal generators. Carried out experiments on large-scale real-world datasets highlight good performances of our algorithm

    Towards an Efficient Discovery of the Topological Representative Subgraphs

    Full text link
    With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable

    Multiple instance learning for sequence data with across bag dependencies

    Full text link
    In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation. In this work, we present two novel MIL approaches for sequence data classification named ABClass and ABSim. ABClass extracts motifs from related instances and use them to encode sequences. A discriminative classifier is then applied to compute a partial classification result for each set of related sequences. ABSim uses a similarity measure to discriminate the related instances and to compute a scores matrix. For both approaches, an aggregation method is applied in order to generate the final classification result. We applied both approaches to solve the problem of bacterial Ionizing Radiation Resistance prediction. The experimental results of the presented approaches are satisfactory

    Problem-Solving Knowledge Mining from Users’\ud Actions in an Intelligent Tutoring System

    Get PDF
    In an intelligent tutoring system (ITS), the domain expert should provide\ud relevant domain knowledge to the tutor so that it will be able to guide the\ud learner during problem solving. However, in several domains, this knowledge is\ud not predetermined and should be captured or learned from expert users as well as\ud intermediate and novice users. Our hypothesis is that, knowledge discovery (KD)\ud techniques can help to build this domain intelligence in ITS. This paper proposes\ud a framework to capture problem-solving knowledge using a promising approach\ud of data and knowledge discovery based on a combination of sequential pattern\ud mining and association rules discovery techniques. The framework has been implemented\ud and is used to discover new meta knowledge and rules in a given domain\ud which then extend domain knowledge and serve as problem space allowing\ud the intelligent tutoring system to guide learners in problem-solving situations.\ud Preliminary experiments have been conducted using the framework as an alternative\ud to a path-planning problem solver in CanadarmTutor

    Towards a semantic and statistical selection of association rules

    Full text link
    The increasing growth of databases raises an urgent need for more accurate methods to better understand the stored data. In this scope, association rules were extensively used for the analysis and the comprehension of huge amounts of data. However, the number of generated rules is too large to be efficiently analyzed and explored in any further process. Association rules selection is a classical topic to address this issue, yet, new innovated approaches are required in order to provide help to decision makers. Hence, many interesting- ness measures have been defined to statistically evaluate and filter the association rules. However, these measures present two major problems. On the one hand, they do not allow eliminating irrelevant rules, on the other hand, their abun- dance leads to the heterogeneity of the evaluation results which leads to confusion in decision making. In this paper, we propose a two-winged approach to select statistically in- teresting and semantically incomparable rules. Our statis- tical selection helps discovering interesting association rules without favoring or excluding any measure. The semantic comparability helps to decide if the considered association rules are semantically related i.e comparable. The outcomes of our experiments on real datasets show promising results in terms of reduction in the number of rules

    Protein sequences classification by means of feature extraction with substitution matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.</p> <p>Results</p> <p>In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.</p> <p>Conclusions</p> <p>The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.</p
    • …
    corecore